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Abstract 

Although a— helices and fi- sheets dominate the composition of proteins, other secondary struc- 
tures find their places therein too. 3io and 7r-helices are two such rare secondary structures. There 
is very little objective insight about various statistical aspects regarding the nature of their oc- 
currences. Comprehensive set of reasons behind the existence of 3io and 7r-helices can only be 
obtained if the occurrence profile of these on the primary structure is unambiguously described 
from the perspectives of sequence, structure and evolution. Although studies about the compo- 
sitional and energetic profile of 3io and 7r-helices aren't uncommon, merely that doesn't tell us 
why these (rather unstable) structures are found in the proteins at the first place. Considering all 
the non-redundant protein structures across all the major structural classes, the present study at- 
tempts to find the probabilistic distributions that describe several facets of the occurrence of these 
rare secondary structures in proteins. Structural causes for observing these statistical patterns are 
explained too. Probabilistic profiling of the occurrence of 3io and 7r-helices reveal their presence 
to follow Poisson flow on the sequence. Thorough statistical analysis of sequence intervals between 
consecutive occurrences of 3io and 7r-helices, support this finding. With extensive analysis from 
varying standpoints we prove here that, such Poisson flows suggest the 3io-helices and especially 
7r-helices to be evidences of nature's mistakes on folding pathways. This hypothesis is further 
supported by results of critical evolutionary analysis on 20 major protein domain families, which 
reveal the definitive trend that proteins try to dispose these structures off during evolution, in favor 
of a-helices. Alongside these unexpected and significant results, a new algorithm to differentiate 
between related sequences is proposed here that reliably studies evolutionary distance with respect 
to protein secondary structures. Building upon a firm foundation of structural information, this 
study attempted to address protein evolution from the perspective of secondary structures, with 
rigorous statistical and algorithmic framework. 
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Introduction : 

3io-helices are enigmatic characters. Possibilities of their existence were discussed a good eight years 
before Pauling proposed the structure of a— helix[l,2]. But the number of studies on them are mi- 
nuscule, when compared to the same on a— helices. However, even though the amassed information 
isn't prolific, a systematic survey of them might take any student of structural biology for a bumpy 
ride. That is because works on them often contradict each other, leaving us with a (kind of) confusing 
picture. 
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To begin from the beginning, we jot down the things that we know about 3io-helices. They have 
3 residues per turn (that explains the name) with a translation of 2A along the helix axis with a hy- 
drogen bonding pattern [i — ► (i + 3)]. ~ 3% of the residues are known to exist in 3io-helices (compared 
to~ 32% of residues in a— helices) [3-5]. 3io-helices are irregular in shape with an average length of 3-4 
residues; finding one of them with more than 6 residues is extremely rare[4]; in contrast, a— helices 
adopt long, regular helical structures in proteins. This marks, to quite an extent, the end of range 
of consensus and beginning of the sphere of claims and counter-claims. For example, to describe 3io- 
helices in the torsion angle space, the ideal assessment was (— 60°, — 30°)[3]. In reality, one study [4] 
asserted the preferred mean value of <ft — ip for them is (—71°,— 18°), the other[6] claimed it to be 
(—50°,— 30°) (apparently, a more precise variant of it [7] read (—49°, —26°)), according to another 
[8] the same is (-63°, -17°), while according to still another [9] it is (-68°, -17°); and for the 3i - 
helices containing non-native amino acids the preferred <ft — ip value was found to be (—54°, — 35°)[10]. 
Moving on to energetic perspective, while a number of theoretical and computational studies asserted 
that a— helices are energetically more stable than the 3io-helices[ll-15], analysis of ESR(Electron Spin 
Resonance) spectral data [16,17] pointed at a coexistence of 3io and a— helices. Although it is easy to 
understand that initiation for helix formation will be easier in case of 3io-helix than for a— helix (one 
fewer unit to consider before the formation of first hydrogen bond), it is found from literature [18] that 
it is not 3io-helices but "only" a— helices, /3-sheets, or short covalently bridged cycles (as in conotoxins 
or in metallothioneins), which can serve as nucleations for initiating protein folding. Easy melting of 
3io-helices at lower temperature[19], serves as another proof of the unstable nature of their structures. 

Variation in the <j> — ip magnitude in a— helices can also be detected easily; however, such variations 
can always be observed to be limited within a small area of 'Ramachandran map', implying clearly 
a deterministic well-cut-out plan of nature to assign categorical significance to [i — > (i + 4)] hydrogen 
bonding pattern. Remarkably, as shown later in this work, despite sharing the same energy minima 
with a— helices in <j> — ip space [20] and possessing an a— helix-like CD spectrum[21], the 3io-helices' 
<j> — ip values can be observed to populate all possible nooks and crannies of 'Ramachandran map'. 
The considerable extent oi <fi — ip variation in these helices hinted us to explore the probability of 
looking at these helices as some kind of ad-hoc structures that had been used while making the pri- 
mary structure fold, but with which much of well-cut-out significance scheme had not been attached, 
as it was done with a— helices. A previous work [3] had studied particular causalities behind unex- 
pected torsion angles in 3io-helices, but it was from a bottom-up perspective (treating amino-acids 
one-by-one) and therefore could not provide a general answer to a simple question "why at the first 
place, are they there?". Many a studies (helix-coil transition [22], helix nucleation [3], transitions be- 
tween 3io and a— helical conformers in domain motions ([23], [24]), molecular dynamics based work to 
observe [i — ► (i + 3)] hydrogen bonding pattern in helical peptides[25],[26]) - have described the role 
of 3io-helices. But a careful observation into them reveals that all of them merely tried to justify the 
existence of them in proteins. But none of them, could explain what the proteins would have missed if 
3io-helices were not there (analogous question on a— helices, would have been an easy assignment for 
freshman structural biology student; but not this one). 

Rigorous literature survey could dig up works reporting the presence of a 3io-helix in a groove formed 
between two a— helices within a protein[27], their (possible) role in protein-protein interaction [28, 
the English abstract of a Japanese article] and in motif formation[29]. Some other studies revealed 
the presence of single 3io-helices in active sites [30,31] and possible role of a single 3io-helix in RNA- 
binding[32] and receptor-binding[33]. However, an immensely interesting trend could be observed in 
the way the "role"s of 3io-helices were reported in these studies. Expressions like - "the larger flexible 
loop includes one turn of a 3io-helix that comprises the binding site .."[30], or ".. a short 3io-helix, 
found immediately N-terminal to the first (3— strand in RRM1, may interact with RNA directly" [32], 
or "the active site cysteine lies in a cleft formed by a coil region that includes the 3io-helix and a 
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loop .."[31], or "a possible physiological role of the 3io-helix present in G-CSF for its receptor binding 
activity"[32] - strongly indicate the predominant trend of 3io-helix research. For a student of structural 
biology, it becomes a little difficult to assign great importance to "one turn of a 3io-helix" or "a short 
3io-helix" (when we know that their average length is 3-4 residues[4]); in rare and isolated occurrences 
that do not follow a general pattern across universe of proteins. On top of it, use of the words like 
"may" and "possible", and an observation of occurrence of a 3io-helix alongside a loop within a coil 
etc.. - suggests the student that the aura of "importance" attached to the 3io-helices, might be a little 
too amplified. 

Picture of 3io-helix studies from the realm of helix-coil transition and (separately) from helix for- 
mation studies is as eminently confusing as it is from any other sphere; with two notable differences. 
First, many of these studies involve peptides (especially, the a— Aminoisobutyric acid (AIB)). Second, 
the minute aspects [i — > (i + 3)] and [i — > (i + 4)] hydrogen bond studies depend on exact definition 
of a hydrogen bond used in the calculation. It is interesting to note that AIB is not a proteinogenic 
amino acid (in fact, it is rare to be found in nature), but in oligomeric form it readily forms 3io-helices. 
To what extent a hypothesis constructed out of AIB studies be relevant in the complex domain of pro- 
teins, is a rather philosophical question and we refrain ourselves from commenting on that. However 
the very fact that proteins are much complex and enormously diverse machines than the peptides, can 
never be questioned. Studying the transition phenomenon with various peptides and (some) proteins, 
some works [22,16,34] claim the formation of 3io-helices as necessary step in a— helix formation; while 
another[20,35], drawing upon the results of double- label ESR and NMR studies on Ala-rich peptide se- 
quences, claim a coexistence of a and 3io-helices. The [i — > (i + 3)] and [i — > (i + 4)] hydrogen-bonding 
studies faithfully reproduces this conundrum. In a study [36] of exceptional relevance to the present 
one, it was confirmed that [i — > (i + 3)] bond formation does take place while helix formation and 
denaturation were studied with synthetic Ala-based peptides; but the same work failed to observe the 
formation complete 3io-helices !! Another work [37] reported the "curious" case of [i — ► (i + 3)] bond 
formation amidst [i — > (i + 4)] bond breaking while exploring alanine-based peptides with MD. In an 
expression, reminiscent of the ones presented in the last paragraph, "transient turn structures with 
[i — ► (i + 3)] hydrogen bond" was reported [38] while studying unfolding on an 18-residue peptide with 
1 ns MD. But tackling the problem the other way round, another study asserted that formation of 
3io-helices is not a necessary step in the transition from coil to helix[39]. - Taken as a whole, perfect 
democratic coexistence of all kinds of contradicting findings with no clear picture. 

Thus, to summarize it all, first, we couldn't to find a single work where a 3io-helix has been identified 
to perform either a structural or functional role which a (small) a— helix at the same coordinate would 
have failed to perform. In fact, a whole slew of studies point out how unstable they are. Second, 
population of 3io-helices in all possible corners of Ramachandran map (a cursory glance at 3io <fr — ip 
space (fig.-l)a) reveals the presence of points like : (-60°, +140°), (+70°, +30°), (+100°, -150°), 
(-40°, -40°) with two remarkable cases at about (+170°, +170°) and (+90°, -90°)) - suggested cat- 
egorically that, nature doesn't always prioritize the act of ensuring local or locally-global or global 
energy minima for the 3io-helices. (The same for w— helices is presented in (fig.-l)b). In fact, as 
a recent study [40] pointed out, since there are position-specific shifts in <f> — ip of the 3io-helices in 
proteins, attempting to describe structural invariants in them with 4> — ip construct, isn't tenable at all. 
Third, although the role of 3io-helices on the formation and uncoiling of a— helices has been pointed 
out previously ([20], [23]), no study have ever proved conclusively that presence of 3io-helices is oblig- 
atory in either the formation or uncoiling of a— helices. Fourth, "one turn of a 3io-helix" or "a short 
3io-helix" etc., might at times be associated with some role in either protein structure stabilization or 
in protein function; but an overwhelming absence of that trend across protein universe implies clearly 
that such singular occurrences are mere accidents and not a part of general (efficient) mechanism that 
characterizes functioning of nature. Hence, for an observer of protein reality who relies purely on the 
data, it might not be unreasonable to consider 3io-helices as pure liabilities, from a structural per- 
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spective. Although one study [41] had reported the occurrence of 3io-helices on the protein surfaces, 
merely that does not ensure their possible role in protein function. We could not find a study that 
shows a statistically significant (or at least five or ten cases) where a 3io-helix performs an extremely 
important protein function and where this particular function couldn't have been performed by a small 
a— helix. 



To find an answer to our simple yet fundamental question "why are they there?" we chose to de- 
fine the problem from a statistical standpoint. To be precise, we wanted to characterize the statistical 
pattern of occurrence of 3io-helices on primary structures of all the protein structures of non-redundant 
PDB[42] ranging across all the structural domains of SCOP[43]. The objective and unambiguous na- 
ture of statistical distribution in their occurrence profile will provide an ideal framework to study other 
outstanding issues related to 3io-helices. To explore the possibility of a latent periodicity or hidden 
pattern in the (rare) occurrences of 3io-helices, a rigorous mathematical survey of the separation be- 
tween individual occurrences of 3io-helices on the sequence axis, across all the SCOP classes, were 
carried out. 



As one scans through the primary structure of a protein, certain characteristics in the mode of the 
appearance of 3io-helices stand out prominently. In objective description of the situation, suppose at 
the coordinate Si of the primary structure the observer notices a 3io-helix. He notices the next 3io- 
helix at another point 5 2 and then at S3 and so forth. Although the energetic details of co-existence 
of 3io-helices and a— helices have been reported in some previous studies ([20], [23]); it is not biophysi- 
cally viable to assume that every time there's a 3io-helix, it would have to be in the vicinity of some 
a— helix in the primary structure. In our dataset we observed that these rare secondary structures 
can occur anywhere in the structure: in the vicinity of a— helices and sheets, as well as independent 
of any other regular secondary structure. Hence, while considering the occurrences of 3io-helices on 
the primary structure, we described them as they are (i.e., without resorting to a— helices to describe 
the coordinate of a 3io-helix). With such a construct to scan the sequence, the observer might notice 
some extremely interesting facts, namely : 
The biophysical details : 

B.l) Very few residues fall into the class of 3io-helices (it was observed in a previous study [5] that 
only 4% of the residues could be classified into 3io-helices, from their select data-base). 
B.2) Even within the helix population, ~ 20% of all protein helices adopt the 3io-helix conformation 
[20]. The number of 7r-helices is even less. (On a related note, although a previous work [4] had 
reported the presence of 3io-helices in all-/? proteins, we failed to find a single instance of such occur- 
rences). 

The mathematical details : 

M.l) No periodic (or other global) pattern could be detected in the occurrence profile of 3io-helices. 
M.2) Probability of two or more 3io-helices occurring at the same coordinate of primary sequence is 
zero (hence the occurrence profile of the 3io-helices constitutes an ordinary flow of events), 
M.3) (Related to M.l) The chance of not finding a 3io-helix for i units of sequence during a search is 
the same as that of a fresh search that fails to find a 3io-helix in the next i unit of sequence. In other 
words, past history (implying search conducted along the sequence till the point search has reached) 
has no effect on finding or not-finding a 3io-helix. Mathematically : 

P (S > Si + S\S > Si) = P (S > Si), where s, denotes the coordinate of sequence where the search op- 
eration is halting at present and 5 is the distance measured on the sequence units, on which the search 
for 3io-helix is being carried out. 

This property suggests that occurrence of 3io-helices on the primary structure possess the 'memoryless 
property'. We note further that a distribution that attempts to describe the occurrence profile of the 
3io-helices, should be able to describe the amount of sequence length one needs to scan before the 
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event of detecting another 3io-helix occurs. At the same time, this distribution should be continuous 
in nature. 

M.4) (Related to M.l and M.3) Differences in the occurrence profile of 3io-helices on the sequence 
axis (S2 — Si , S3 — S2, Si+i — Si, .. ) are stochastically independent for any natural i. 
Denoting the event of detection of a 3io-helix by D$i, we can generalize M.4 to describe the stochastic 
independence of the differences in the occurrence profile of 3io-helices on the sequence axis by asserting 

[( Asi — Dso) , ( Ds2 — Dsi) ,...,( — -Dsi_i)l are independent for any natural i and further, 
for every < S < Si < . . . < S t . 

Together these properties ( M.2 and M.l, M.3, M.4, along with B.l and B.2) imply that there 
exists a real parameter A, such that for every i on the distribution of Si, it holds true; and as a whole, 
describes a Poisson distribution with parameter As, where s denotes the particular length on sequence 
axis between two consecutive 3io-helices. 

In fact, a distribution endowed with characteristics and requirements as stated above, can be con- 
sidered as a special class of continuous probability distributions, which describe the sequence intervals 
between detection of two consecutive 3io-helices. The process of detections occur independently yet 
continuously at a constant average rate on any sequence under consideration, a hallmark of Poisson 
processes. The exponential distribution occurs naturally when describing the lengths of the inter- 
arrival times in a homogeneous Poisson process [44-46]. 

Hence to define our problem categorically, we attempt to construct a counting algorithm that scans 
through the primary structure and tries to detect the 3io-helix, before verifying whether the occur- 
rence profile of these follow a mathematical pattern, or not. Search for any particular 3io-helix takes 
place on a random length of sequence of extent S (that is, the length of entire primary structure un- 
der consideration). Searching operation has an exponential distribution with parameter = 
where j3 denotes the actual number of occurrence of the 3io-helix in S. On the basis of observation 
(discussed above), we accept that the appearance of 3io-helices follow a Poisson flow with intensity 
A (A being the average number of detection of 3io-helix per unit traversal of sequence length; thus 
A = A(s). The constancy or variability of A will be determined by the sequence under consideration). 
Designing a full-proof detection scheme to identify 3io-helices is extremely difficult. Hence, we in- 
troduce a parameter p that describes the probability with which the counting scheme can detect the 
3io-helices. Finally, we designate a random variable X to describe the number of recorded 3io-helices 
from a given primary structure. In this work, we propose to find its distribution and corresponding 
characteristics of the mean (m x ) and variance (Var x ), before verifying these theoretical predictions 
with actual profile of occurrence of rare secondary structures. Once functional, on utilitarian front, the 
present algorithm can help the researchers to estimate the number of 3io-helices, when the sequence 
information is provided. On the theoretical aspect, the ramification is deep. Success of this algorithm 
will imply that our interpretation of <f> — ip distribution (fig.-l) of 3io-helices was correct. Perhaps 
even more importantly, it will suggest that occurrence profile of 3io-helices on sequence can indeed be 
described reliably with Poisson distribution; which in its turn will imply that occurrence of 3io-helices 
on sequence is a randomly occurring event that take place purely 'by chance' (the hallmark of Poisson 
process). 

Spectrum of results from various types of statistical surveys had provided us with reasonable in- 
dications that our hypothesis regarding Poisson nature of occurrence profile of 3io-helices (and its 
implications), to a large extent, is true. If it is so, one would expect that during evolution, proteins 
must somehow try to dispose the 3io-helices off. To put this hypothesis to strict and critical test, a 
thorough study of PDB structures classified by Pfam (release 24.0) [47] was undertaken. Pfam is a 
database based on hidden Markov model profiles (HMMs) , which combines high quality and complete 
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protein domain families with high quality alignments. We selected sequences from top 20 Pfam fam- 
ilies for which the structures are known. We wanted to observe the changes in the parts of sequence 
having 3io and w structural state in the pairs of proteins with different evolutionary distances within a 
family. Since such changes should be more clearly revealed in the proteins with different evolutionary 
distances within a family; Pfam provided an ideal framework for such an analysis. 

As a part of phylogenetic analysis, an extensive (computational) analysis was conducted to count 
substitutions of a RSS on the sequence space. Although literature on contemporary Biology abounds 
with sequence-based evolution studies, a rigorous work that unambiguously lays down the criteria set 
to characterize substitutions with the help of evolutionary distance and alignment information, was 
difficult to find. In most of the studies on protein evolution, either the complete sequence, or the 
complete 3D structures or domains have been considered. This leaves us with little insight as to how 
secondary structures evolve and on the other hand, how the changing profile of structural parameters 
of secondary structures contribute in evolution of proteins as a whole. We propose a methodology to 
achieve the same in the present work. It assumes significance to mention here that our approach differs 
notably from any of previous studies on similar paradigm [48-53], not only in its basic motivation but 
also in the implementational procedures. 

Evolutionary analysis were carried out on all the protein sequences drawn from twenty most prominent 
Pfam families. An elaborate set of comparisons were undertaken to observe the evolutionary trends 
in them, from the perspective of secondary structural elements. Although we could note the previous 
attempts to investigate this (rather difficult) paradigm [50, 54-56]; our approach was different both in 
its motivation and implementational details. In a two part scheme, we started by describing evolution 
of aforementioned primary structures with respect to : a) position-specific retainment of 3io-helices, b) 
position-specific replacement of 3io-helices, and c) position-specific introduction of new 3io-helices. In 
the next stage, the cases of position-specific replacement of 3io-helices were surveyed across the twenty 
most prominent Pfam classes, to obtain an idea about the proportion of transformation of 3io-helices to 
a-helices. Since such analysis involved seven major structural classes of protein domains (SCOP), the 
present study could capture the entire paradigm from primary structures to protein domains, through 
the standpoint of secondary structures along evolution. 

7r-helices (characterized by the hydrogen-bonding[i — > (i + 5)]between residues) are even more rare 
in proteins than the 3io-helices[7, and references therein]; the number of studies on them is less too. 
Reasons for their energetically unfavorable structure is described masterly in a previous work[7]. All 
our aforementioned assertions about 3io-helices can as well be extended to the realm of 7r-helices on the 
primary structures too. Hence exactly the same methodology is applied to investigate the occurrence 
profile of 7r-helices too. In fact, if the mathematical and biophysical prerequisites are satisfied under 
certain biological contexts, one can even generalize the present set of reasons and algorithms to the 
entire set of other rare structural patterns too (poly-Proline helix, (3— bulge, a-turn, /3-turn, 7-turn, 
7r-turn, w-loop etc.). In the present work though, the results and implications of present algorithm is 
discussed with respect to it's application to only 3io and 7r-helices on the primary structure. 

Materials (Dataset collection and classification:) : 

We obtained our dataset of protein structures from Protein Data Bank. All the structures were ob- 
tained using advanced query feature of PDB and the domain sequences were classified in different 
classes according to structural classification proposed by SCOP (Structural Classification of Proteins). 
SCOP is a database of protein domains, accordingly we limited our analysis to the domains only, entire 
protein chains were not considered. Only true classes in SCOP were taken into consideration (all-a, 
all-/3, a + j3, a/P, membrane, multidomain and small proteins). Analysis was performed in a SCOP 
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class-specific manner, to identify variabilities due to differences in the distribution of 3io and 7r-helices 
in different protein folds. However, since the number of 7r-helices is small and statistically insignificant 
in any one SCOP class, class-specific analysis could not be performed for 7r-helices. We used a relaxed 

o 

criteria (70% sequence identity, crystal structure resolution better than 3.5^4 ) while collecting the 
sequences in any domain class to avoid missing out on any piece of information. 

In each SCOP class, 3io and 7r-helices were identified using DSSP algorithm since PDB assignments 
are known to be subjective and incomplete[57]. 



Methodology : 

Section -1) : Mathematical backbone : 

For the purpose of detection of 3io and 7r-helices, we describe a random continuous part of the 
entire length of primary structure(S') by s . We proceed to find the conditional probability that 
X = 7(0, 1,2,...). 

Thus, banking on the aforementioned background, we have : 

P{X = 1 \ S }= { ^e~^ 

Hence, the total probability of the event {X = 7} is given by : 



P{X = 1 \s} = J™ (V^l e - Xpa tie-» s ds 



7! Jo 

(Ap)T 



This is a geometric distribution with parameter ^xp+^) ■> an( ^ therefore, the mean and variance of this 
distribution is given by : 

7,= -^U^ (1) 



and, 



Var, = i^4 = ^tiO = f^V+^=7-(7- + l) (2) 
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Section -2) : Algorithmic Implementation : 

Section 2.1) Statistical analysis based on torsion angle studies : 



Section 2.1.1) Algorithm to (statistically) characterize the occurrence profile of 3io and 
7r helices : 

Crux of the algorithmic implementation of eq n — 1 and eq n — 2 rested with reliable detection of 3io- 
helices. For this purpose, an inclusive criterion was chosen as the first step; only to be revised by a 
restrictive criterion in the second. A torsion angle range that suitably covers most of the 3io-helices of 
the data-set was considered. Due to immense diversity of torsion angle range for 3io-helices, a fairly 
large range <j> : (—40° to — 90°) and ip : (—25° to — 55°) were considered. A run of at least 3 
consecutive residues in the aforementioned torsion angle range, served as the filtering criterion (since 
most 3io-helices are of length 3, or more residues). 

Denoting the actual number of occurrence any of 3io-helices in arbitrarily long s in S, as 7 ; and defin- 
ing /1 = I/7; 7 and /iwere calculated. Since occurrences of 3io-helices are rare and highly nonuniform, 
considering the entire sequence S in one go to detect the pattern in occurrence of 3io-helices, may 
have introduced various coarse- graining type errors. Hence S was sub-divided into overlapping shorter 
sequences s, where each s is statistically significant (if | S \= n and statistically significant length of 
s is r (r > 32), then (i = 0) — > r , (i = i + 1) — > r + 1 , . . . (i = n — r + 1) — > n ). In this manner, 
the entire sequence could be scanned, taking into consideration the (possible) local bias that may be 
associated with 3io-helix occurrence. Calculations were repeated with various magnitudes of r , to 
identify the (possible) latent bias. To not miss out on the overall picture, A = (M/IS*!) is calculated, 
where A denoted the average number of detection of 3io-helices per unit traversal of sequence length 
using torsion angle assignment of 3io-helices. 

The actual number of 3io-helices for every protein was calculated using DSSP; this is denoted by 

7 X (DSSP). The 72;(DSSP) is subsequently equated to p * to find the correction parameter p for 

every protein. (Correction parameter is necessary to address the fact that certain 3io-helices might 
be missed even with the best of the efforts to identify them with torsion angle range and some of the 
3io-helices may be incorrectly assigned using torsion angle range). Using the magnitude of p obtained 

in the last step, 7 :c (torsion) is calculated for the test cases applying the formula (m)' Finally, the 

magnitudes of j x (torsion) and 7,,, (DSSP) are compared (for the entire data-set) to test the hypothesis. 

Section 2.1.2) Algorithm to observe (statistical) trends in intervals between 3io and ir 
helices : 

Further analysis were carried out to (statistically) model the patterns of sequence intervals between 
consecutive occurrences of 3io-helices and 7r-helices. This study was essential because of the incon- 
clusive nature of findings obtained from the torsion angle based study on the occurrence profile of 
the 3io-helices. The inter-arrival sequence-distances between the observed occurrences of consecutive 
3io-helices (from DSSP assignment), for every protein containing 3io-helix (resolution < 3A) in non- 
redundant PDB, across all the SCOP classes, were studied. For each SCOP domain, the statistical 
distribution with best score of 'x 2 -goodness of fit' that models the inter-arrival sequence lengths be- 
tween 3io-helix occurrences was identified. 

Analysis of the property set of statistical distributions often reveals latent trends in a dataset that 
are not detectable otherwise. Such analysis can be of enormous benefit when the causality behind 
the occurrence of certain events are difficult to ascertain. Our anticipation regarding the necessity of 
rigorous and exhaustive statistical analysis on occurrence profile properties of 3io-helices, as carried 
out by the implementation of aforementioned algorithm, proved to be correct as the findings (kept in 
'Results and Discussions' section) demonstrated. 
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Section 2.2) Phylogenetic Analysis : 

2.2.1) Framework of the work : 

Sequences were aligned and evolutionary distances between each pair of the families were computed us- 
ing MEGA [58]. Distance calculation requires at least one common aligned site in the multiple sequence 
alignment. Hence, only the sequences with comparable lengths were considered. Results from a previ- 
ous empirical study suggested that the substitution rate usually varies among amino acid sites during 
protein evolution; furthermore, this rate variation approximately follows the gamma distribution[59]. 
Since substitution rate also varies with amino acid pair, we computed JTT (Jones- Taylor-Thornton) 
distances[60] for evolutionary distance comparison. To ensure the best possible result, substitution 
rate was assumed to follow gamma distribution with shape parameter 2.4 [59]. 

Distance comparison analysis yielded a symmetric matrix of pairwise evolutionary distances for each 
family. For each pair of proteins in a family in the evolutionary distance matrix (it is an upper- 
triangular matrix excluding self), the number of substitutions of any 3io-helix with any other secondary 
structure in the multiple sequence alignment of the proteins were counted. Suitable substitution cri- 
terion was decided after an extensive survey of the evolutionary distances and alignments (detailed in 
the next section). 

2.2.2) Determination of substitution criterion : 

Any one of the sequences from the pair under consideration was chosen as the reference sequence. 
Locations of the 3io-helices (assigned by DSSP) were mapped on the sequence alignment. If any of 
the 3io-helices in the reference sequence occurs within a 3 residue range in the other sequence, it was 
considered to be on the same sequence coordinates. This 3 residue buffer was considered merely to ac- 
count for (possible) insertion/deletion in another region that might affect the position of the 3io-helix 
under consideration. The event of absence of a 3io-helix within this 3 residue range in the reference 
sequence, can be either due to a replacement of 3io-helix or incorporation of a new 3io-helix in one of 
the sequences. The decision to select between these two possibilities was taken on the basis of thorough 
study of evolutionary distance. Trends from the obtained results suggested that with the increase of 
evolutionary distance, the tendency to lose 3io-helix in a pair of aligned proteins also increases. From a 
comprehensive survey of various distances, we found that the mean evolutionary distance can be used 
as a consistent and efficient parameter to choose between replacement and new RSS formation. As a 
logical continuation, when the evolutionary distance of the pair under consideration was found to be 
greater than the mean distance for that family, it was considered to be a case of replacement; whereas 
when it is less, a case of insertion of a new 3io-helix was registered. To avoid multiple comparisons of 
the same3io-helices, comparisons were limited between 3io-helices situated nearby on sequence coor- 
dinates. This procedure was repeated for all the pairs of proteins in the top 20 Pfam families, where 
at least one of the sequence contains at least one 3io-helix. 

The algorithm narrated above counts and characterizes the substitutions of a rare secondary structure 
(namely 3ioand 7r-helix in the present case, but it can be applied to poly-Proline helix, (3— bulge, a- 
turn, /3-turn, 7-turn, 7r-turn, w-loop etc.. too) on sequence space. We devised this strategy because 
we could not find any study in which such kind of rigorous evolutionary analysis has been performed. 
Although related analysis on sequence/structure evolution have been carried out earlier, motivations 
for those studies differed notably from the scope and depth of the present problem. Most importantly, 
a methodology that investigates similarity between entire 3D-domains across major Pfam sequences 
on a statistical scale, building upon thorough knowledge of the effects of evolution on secondary struc- 
tures, was not found in existing works. 
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Results and Discussions : 
Section -1) 

Results obtained from Methodology Section 2.1.1 with Discussion : 

The comparison results (between the predicted value of ^ x (mean of the distribution) from our algo- 
rithm and values provided by DSSP) are extremely interesting. These results (fig. 2)a-2)g) show 
that the trend of obtained values of -y x predicted from our algorithm. The ordinate magnitude in 
these figures describes the error (in probability units, between 0.00 to 1.00) in prediction, whereas the 
difference in abscissa describes the absolute number of cases where a particular error of has happened. 
(The sorted profile of these numbers, when used as axes, therefore, described how much of error was 
committed, for how many number of times, for a particular SCOP class.) While these trends can be 
observed to distinctly follow the actual occurrence profile of 3io-helices in proteins, they under-predict 
the j x magnitude on 3io-helices consistently. A careful observation of the ordinate magnitudes shows 
that the maximum margin of error committed for all-a, a/ f3 proteins is (merely) 0.09 (probability) 
unit; for membrane proteins and small proteins it is 0.08 (probability) unit only; for a + (3 proteins, 
it is an even less, 0.06 (probability) unit; while for multidomain proteins the error is an (absolutely 
negligible) 0.05 (probability) unit. When applied on 7r-helices, this algorithm could predict the -y x 
magnitudes with extreme reliability (maximum error margin < 0.025 (probability) unit). Still, striv- 
ing for further accuracy we wanted to ascertain the reason for the (small margin of) under-prediction. 
For this investigation, we randomly selected 20% of the proteins from our dataset manually, to re- 
peat the experiment. This examination revealed that due to tremendous variation in the range of 
torsion angles in the3io-helices (cutting across the SCOP classes), the filtering criterion for detection 
of 3io-helices (based on torsion angles) fell short in identifying them. Position-specific shifts in <f> — ip 
of the 3io-helices in proteins reported elsewhere[40], supports this argument. (Nonetheless, we could 
not find any algorithm to detect the 3io-helices with better efficiency from existing literature than 
our procedure.) Hence our program could not detect ~2%, ~4% and (7-9)% of the residues belonging 
to 3io-helices in the multidomain-proteins, a + (3 and membrane proteins and a//3, all-a and small 
proteins, respectively. This consistent error in detection of residues belonging to 3io-helices explains 
the small yet measurable differences between 'predicted' and 'real' trends in (fig. 2)a-2)g). 

Leaving the speculative thoughts regarding how closely the predicted patterns of occurrence pro- 
file of 3io-helices would have matched that in protein crystal structures, if the torsion angle based 
detection scheme for 3io-helices had worked properly, certain subtle trends from the obtained results 
(fig. 2)a-2)g) can clearly be noticed. The results, without the torsion angle related correction, tends 
to suggest that the basic pattern of the occurrence of 3io-helices on the primary structures can be 
reliably reproduced although the extents of such occurrences might not match. Reproduction of the 
basic trend in their occurrence (extremely well for multidomain proteins; moderately well for a + (3 
and membrane proteins; a touch poorly for all-a and a//3 and small proteins) opens up a debate with 
two-pronged possible explanations; viz. : 

1) Some minute aspects of pure Poisson flow might not be pertinent when attempting to understand 
nature's plan to place 3io-helices on the primary structure. This implies that although in some proteins 
3io-helices occur as accidents and can well be considered as fossils of folding pathway exemplifying 
nature's faulty (yet, quickly rectified) plan; in case of some other proteins, perhaps, nature associates 
some subtle yet definite plans behind the construction and placement of 3io-helices on the primary 
structure. While the first part of this argument forms the basis of the present work, tiny deviations 
of predicted magnitudes from the observed ones (for all-a, a//3 and small proteins), can possibly be 
explained with the second part of it. 

2) If the same calculations are performed with an algorithm that detects 3io-helices reliably (as and 
when it is constructed), the predicted trends will converge to the actually observed trends; suggesting 
pointedly that the trends in occurrences of 3io-helices on primary structure are indeed accidental and 
rare; which, in turn, establish the hypothesis of possible mistakes of nature while making the protein 
fold, with 3io and 7r-helices as impeccable examples of it. 

Results presented under 'Result Section-2' resolves this debate. 
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Surprisingly, the detection scheme for 7r-helices did not suffer from such errors. The distribution 
of 7r-helices on the sequence axis (fig. - 2.g) provided a clear indication that their occurrences on 
the sequence axis takes place purely by chance and at random. As explained in the introduction sec- 
tion, such an observation indicates strongly that (probably), nature does not attach priority to the 
construction of 7r-helices. Poisson occurrence of 7r-helices hints (almost definitively) at implying that 
nature does not have a categorical plan in identifying what to do with them (either structurally or 
functionally) and they exemplify the errors committed by nature while making the primary structure 
fold. Probably the 7r-helices are true fossils of nature's plan while constructing a-helices. 

Section-2 : 

Results obtained from Methodology Section 2.1.2 with Discussion : 

Although the principal cause of our algorithm's subtle yet tangible under-prediction of 3io-helices 
could be established, in order to verify the basic hypothesis (that is, nature did not attach much of 
an importance with the construction of 3io-helices and 7r-helices; and probably, occurrences of these, 
merely depict nature's mistakes and not her plan) we undertook a study to probabilistically model the 
patterns of sequence intervals between consecutive occurrences of 3io-helices and 7r-helices across all the 
SCOP domains of all the protein crystal structures (resolution < 3A) contained in the non-redundant 
PDB set classified as per SCOP classes. This study was significant because of the inconclusive na- 
ture of findings obtained from the torsion angle based study on the occurrence profile of the 3io-helices. 

Study of distribution of sequence intervals between primary structure coordinates for 3io-helix oc- 
currences presents us with unexpected (with the existing knowledge of protein structure) and an 
enormously interesting set of data (described in Table-1). The distribution for all-a and membrane 
proteins had a x 2 - best-fit with Johnson-SB distribution; for a/ (3 , a+f3 and multidomain proteins, 
the Weibull and exponential distributions served as ideal templates; whereas for the small-proteins, it 
was the log- Weibull distribution. Implications of these best-fit results are unexpected and deep. On 
the other hand, it provided an ideal construct to validate the data presented in the Section-1 of the 
Result. 



Class of Proteins 


Statistical distributions 


All-a 


Johnson-SB Distribution 


Membrane 


Johnson-SB Distribution 


a + fi 


Exponential Distribution 


a/p 


Weibull distribution 


Multi-domain 


Weibull distribution 


Small 


Exponential Distribution 


7r-helices(all classes) 


Pareto Distribution 



Table-1) : x 2 -best-fit distribut ion to model inter-arrival sequence length between con- 
secutive 3io/tt helices. 

Exponential distribution routinely describes the lengths of the inter-arrival duration in a homogeneous 
Poisson process [44-46]. Weibull distribution is known to be a special case of exponential distribution. 
Exponential distribution's role as the best template for the % 2 - best-fit for a+(3 domain of proteins 
and third best-fit for small proteins, point unambiguously towards a Poisson profile of occurrence of 
3io-helices in the primary structure of a+f3 and small proteins. The Weibull distribution describes the 
phenomena space in-between that of exponential distribution and Rayleigh distribution [61,62], both 
of which are best known for their description of Poisson process related phenomena. The fact that a/ (3 
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and multidomain proteins have Weibull distribution as the best-fit template for chi-square goodness 
of fit, suggests in an indirect (yet affirmative) way that the pattern in 3io's occurrence on the primary 
structure in these structural families to be related to Poisson school of statistical distributions. These 
results implied further that the difference between the 'predicted' and 'experimentally observed' worms 
in figures [3-8] are purely due to failure of detection of 3io-helices in the primary structures and not 
due to erroneous and/or simplistic aspects in our basic hypothesis. (Cause behind 3io-helix detection 
failure with torsion angle based algorithm has been discussed beforehand). 

However, (in the midst of all these Poisson school of random and rare occurrences) the all-a and 
membrane protein distribution profiles (\ 2 - best-fit distribution template 'Johnson SB', a distribution 
closely related to the classical normal distribution), points clearly to some prudent plan of nature 
behind constructing and placing the 3io-helices in these two structural families. Although an exact 
outline of this plan is difficult for us to envisage, the internal organization of helices within all-a and 
membrane proteins provides us with significant clues for that pursuit. The fact that the predicted 
occurrence profile of the 3io-occurrences profile for all-a and membrane proteins differed markedly 
(fig-3.a and fig-3.d, respectively), finds a support in these findings. It is interesting to note that a 
number of studies [63-65] have hinted at the presence of conducive environment for presence of helical 
structures in the membrane proteins. They (membrane proteins) constitute a structural class where 
there are fewer opportunities to destabilize the helical hydrogen bonds. Hence even if the preferred 
hydrogen bonding pattern, viz.[(i + 4) — ► i] is not satisfied, the (strenuous) [(i + 3) — > i] hydrogen 
bonding pattern amongst residues can well be accommodated within the membrane environment. On 
the other hand, since the constraint of using solely the helical structures as building blocks for the all-a 
proteins(3.6 residues per turn with a translation of 1.5A along the helix axis, hydrogen bonding pattern 
[(i + 4) — > i]) can be daunting under every circumstances and since the transition to 3io-helices(they 
have 3 residues per turn with a translation of 2A along the helix axis, hydrogen bonding pattern 
[(i + 3) — ► i] ) - might not be extremely difficult to accommodate for small lengths of sequence, a non- 
trivial probability can be attached to the formation of 3io-helices within all-a domains, which may 
well show non-random and non-rare characteristics. Thus, although mutually different, the causality 
behind all-a and membrane protein's not following Poisson family of distributions (w.r.t occurrence 
profile of 3io-helices in them) - can well be understood from the framework of proposed logic of the 
present work. These results tend to imply that merely the lack of detection accuracy (~8% for all-a 
and ~4% for membrane proteins) is not the sole reason for 3io-helix occurrence profile in these two 
structural classes to deviate from Poisson school of reasoning that govern the same in other classes. 

A (near) perfect x 2 -g°°dness of fit result for the sequence-interval profile of the 7r-helices could be 
obtained with the second order Pareto (lomax) distribution. Since Pareto distribution is a power-law 
distribution, it essentially models a stochastic process. Hence occurrences of 7r-helices in the primary 
structures tend to suggest that a stochastic process, instead of a deterministic process, can best de- 
scribe their occurrence profiles. This finding bolsters our assertion (stated earlier) about them. 

To summarize the results and at the same time, to compare and contrast them with the premise 
of starting hypothesis, we can enumerate the principal findings; as : 

1) number of residues forming a part of 3io-helices are less than 3% of the total number of protein 
residues in non-redundant PDB structures. 

2) not a single study could be found that reports a statistically significant number of cases (even, as 
low as five cases) where a 3io-helix could be found to perform a significant role in either stabilizing the 
protein or performing a tangible function, where an a-helix would have failed to perform these jobs. 

3) the occurrence profile of 3io-helices on the primary structure has been found to be irregular and 
non-repetitive for some protein structural classes to the extent of being almost random and rare for 
some and completely random and rare for some. 

4) the torsion angle range for 3io-helices are farthest from being considered as structured. 
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5) lengths of 3io-helices are typically small, ranging from 3-5 residues on an average. 

Indeed all these arguments (with much more veracity) can be put forward to describe the case of 

occurrence of 7r-helices. 



Section 3): 

Results from phylogenetic analysis (Methodology section 2.2) with Discussion : 

The hypothesis was put to test on top 20 Pfam families. The Pfam database links of protein sequence 
profiles with their structures wherever available. For two out of twenty families (PF01535, PF00361) 
structures were not available, while for another (PF07690) the number of structures were inadequate 
to perform any statistical test. For six families (PF00077, PF00516, PF00560, PF00023, PF00400, 
PF06817), the mean length was found to be very small and no 3io and ir helix could be identified. 
Hence analysis was performed on the rest (PF00005, PF00078, PF00115, PF00072, PF00033, PF02518, 
PF00528, PF00069, PF00032, PF00106, PF00583 ). 

Although the details of the results are provided in fig-4, fig-5.a, fig-5.b, and in Suppl. Mat. 1&2; 

here we report the significant trends. In six out of the eleven families (PF00005, PF00115, PF00072, 

PF02518, PF00069, PF00106, PF00583), the events of retainment (Retn) of a 3i -helix was found to 
be less than both the events of their replacement (Rplc) and events of their new insertion (insr) ; 
no cases was observed where (Retn) is found to be greater than (Rplc) and (insr) ; while 
for three (PF00005, PF00033, PF00032) out of eleven families (Rplc) outnumbered (Retn) 

and (insr) both. These are significant findings (unforeseen hitherto). Together, they show nature's 

apathy for retainment of 3io-helices (as have been categorically demonstrated in the obtained results) 
across sequences of all the Pfam families of interest. The other observation regarding 3io replacements 
outnumbering those with 3io retainments and 3io insertions in non-trivial number of cases, takes the 
non-retainment trend to a logical finish. (Managers do not want the sacking of an inefficient employee 
merely, they want him to be sacked for a better replacement). Hence results obtained from this part 
of the analysis is found to be in complete accordance with our hypothesis that existence of 3io-helices 
as nature's fault during protein folding. Interestingly, no correlation was found between SCOP classes 
with trends in the occurrence of anyone of (Rplc) , (Retn) and (insr) (for example, while 

for PF00005(the ABC transporter with structural domain a/j3) (Rplc) > (Retn) > (insr) 
is observed, for PF00106 (short chain dehydrogenase) with same structural domain a/j3 the trend 
reversed a little, viz. (Retn) < (Rplc) < (insr) . Absence of any particular pattern confirms 
that the trends reported earlier in this paragraph are completely general and henceforth, vindicates 
our hypothesis on a global scale. 

But results from the same analysis also reveals that in seven out of eleven classes (PF00078, PF00115, 
PF00072, PF02518, PF00069, PF00106, PF00583) (insr) magnitude is greater than (Rplc) mag- 
nitude ( (Retn) , in any case, is lower than either of them). Although the maximum is a staggering 
9231 out of 12795 cases for protein kinase domain (a + f3), for four out of seven Pfam classes the 
trend (Rplc) <~ (insr) could be observed. Quite unambiguously, this is in contradiction to our 
hypothesis. However, an in-depth literature search and further analysis of the data resolves the crisis. 
Evolution is not an event but a process. The very fact that evolution does not proceed at a constant 
rate, but depends upon various types of environmental selective pressures on the individuals of any 
species at any given time is established conclusively in number of recent works[66-68]. The fact that 
one kind of secondary structure may experience replacements at a higher rate than another is known 
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too[69]. That being the case, evolution can well be compared to a unit-pipeline in a workshop set-up, 
where the supply-line can contain (occasional) faulty materials too; but all that is known is, during the 
formation of the finished product, these faulty elements will either be corrected to the efficient ones, or 
they would be thrown out of any further consideration. Viewing the 3io-helices from this perspective 
solves all the riddles in one unified way. Hence, in the present context, we hypothesized that while 
many 3io-helices are being constantly inducted into the quality control pipeline (that is process of 
evolution), many of them will either be replaced by a-helices or loops, or else, they will be disposed 
off during the process. Selection of probable secondary structures for this hypothesis wasn't difficult, 
in order to release the steric constraints, it is easier for a 3io-helix to transform itself to either a-helix 
or a loop, than to undergo a large-scale rearrangement to achieve the same in the form of /3-sheet. It 
is interesting to note that although there is no disallowed region of Ramachandran map between the 
preferred 4> — ip magnitude for 3io and a-helices[70], the 3io block of <j> — ip is less favorable[71]. This 
smart quality control mechanism ensures a smooth 3io — ► a transition, but opposes a — ► 3io 
transition. Our assertion from evolutionary analysis, viz. 3io-helices are unrealized possibilities in their 
route to become a-helices, find support from the findings of some recent MD studies, where 3ioand 
7r-helices are typified as "transient" and "defective" a-helices[72-74]. 

Results obtained from this analysis (Table#) vindicated the correctness of our hypothesis, completely. 
In all the eleven Pfam classes, 3io-helices are found to be transforming into either a-helices or loops. 
Dominant 3io — > a transformation was found in five out of eleven Pfam classes, whereas for another 
five, a dominant 3io — > loop transformation could be observed. In a single case(PF00072, response 
regulator receiver domain, possessing interestingly, an all-/? structural domain) 3io-helices could be 
observed to prefer a-helices and loops equally. 



Conclusion : 

We have shown here that results from three different and rigorous investigations, with contrasting 
algorithms, converge to confirm the basic hypothesis of ours; namely, nature does not attach any well- 
cut-out plans with the construction of 3io and 7r-helices. We think that the apparently disparate array 
of observations regarding various known facts about 3io-helices (detailed in introduction section), can 
be explained from a common basic platform if it is hypothesized that they symbolize nature's mistake 
during its attempt to make the primary structure fold to its native state. Of course these mistakes 
take place rarely, which explains the rare occurrences of 3io-helices. Going back to an analogy used 
previously, one can view nature as a manager in a hurry while making the primary structure fold; 
probably that is why the temptation of constructing the first hydrogen bond with one fewer unit than 
that required to construct an a-helix, scores better of him. But before long, the smart manager realizes 
his (her, at any rate) mistake. So he drops the plan to persist with the mistake and hence the faulty 
scheme of hydrogen bonds are stopped from becoming longer. Since no particular plan was associated 
with the construction of these structures, no notable example of their involvement in either stabilizing 
the protein or helping it with certain functionality can be found too. The Ramachandran-map for 
these structures, not surprisingly, demonstrates the widest possible spectrum of <j> — ip combinations. 
Such (wild) variability can never be observed for the <p — ip variability range for a— helices or (3— sheets; 
structures with which nature attaches definite importance in lending the proteins with stability and/or 
functionality. The very fact that barring ignorable tendencies in minuscule number of cases, occur- 
rence profiles of 3io-helices across all the protein structural families, tend to follow a Poisson family 
of distribution; suggests that their occurrence in primary structure is indeed random and rare; that 
is, without a specific plan. Finally, extensive evolutionary analysis shows how the new insertions of 
3io-helices, do not contradict our hypothesis and how all the top Pfam classes show no interest in 
retaining the 3io-helices and how nature makes it a point to transform 3io-helices to a-helices and 
loops. - This entire body of evidences, taken as a whole, seem to point to the definite (non-accidental) 
inference of nature's committing mistakes when constructing the 3io and 7r-helices. 
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We attempted to addressed a basic question in this work. Instead of justifying the stance of nature 
on the construction and placement of 3io and 7r-helices with isolated cases and accidental observations 
that conforms to no general pattern; we questioned, with what probability, can we assert that they 
exemplify nature's mistakes while performing protein folding? It turned out that our (somewhat sac- 
rilegious) question did find a reasonable ground when describing the occurrence profile of 3io-helices 
and 7r-helices on the primary structure. The dominating trend of results across the various structural 
classes, from statistical, mathematical and evolutionary aspects, suggested that these (rare) secondary 
structures should be viewed as evidences of nature's mistakes with regard to making the primary 
structure fold to the native state of a folded protein. 
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Figure Title l)a) : The Ramachandran Map of 3io-helices. 

Legend for Figure l)a) : Ramachandran-Map of all the 3io-helices present in non-redundant PDB. 
Figure Title l)b) : The Ramachandran Map of 7r-helices. 

Legend for Figure l)a) : Ramachandran-Map of all the 7r-helices present in non-redundant PDB. 
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Figure Title 2) : Observed-vs-Predicted occurrence of 3ioand 7r-helices 

Legend for Figure 2 : For the concerned class of protein, the number of 3io and 7r-helices in its crystal 
structure were counted. For two points chosen arbitrarily, the graph can be interpreted as the differ- 
ence in probability with which certain number of 3io-helices were predicted and certain were observed. 
Say, two points with ordinate : 0.02 and abscissa : 33 (predicted) and 43 (observed), implies that there 
are 10 (43-33=10) cases where the difference between magnitude of observed probability of occurrence 
3io-helices and predicted probability of occurrence 3io-helices, is 0.02 probability units. In other words, 
the ordinate magnitude describes a particular magnitude of probability; whereas, the abscissa denotes 
the number of cases with that particular probability. Study was carried out on all the non-redundant 
PDB structures across all the SCOP classes. 
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Figure Title 3) : Modeling the inter-arrival interval for 3ioand 7r-helices on all the pri- 
mary structures 

Legend for Figure 3 : Occurrence of 3io-helices on the primary structure do not follow a regular and 
repetitive pattern. In fact these occurrences appear to be random. The sequence intervals between 
consecutive occurrences of 3io-helices (abscissa) are plotted against the number of 3io-helices with 
a particular length of sequence interval magnitude (ordinate). Absolute number of 3io-helices with 
certain sequence-interval was normalized by the maximum length of the primary structure with at 
least two 3io-helices, from the entire non- redundant PDB. Plots have been generated for every SCOP 
class containing 3io-helices. Data fitting is done with the software 'R'. 
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Figure Title 4) : Distribution of replacement, retention and insertion across evolutionary 
distance 

Legend for Figure 4 : Distributions of substitution, retention and insertion over evolutionary distances 
across families, showed that absolute number of replacements and new 3io-helix formation is almost 
same. However, their trends show that with less evolutionary distance(< 4-5), no. of retained and 
new formations are more than no. of substitutions; with no. of new formations being significantly 
higher. But as evolutionary distances increases no. of new and retained 3io-helix decreases drastically 
and no. of substitutions increases. This result validates our methodology (to count substitutions and 
replacements) and hypothesis as it is clear that event of retainment drastically decrease increasing 
evolutionary distance. 
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Figure Title 5) : Results of evolutionary analysis 

Legend for Figure 5)a) : Describes the number of cases of replacement, retention and insertion of 
3io-helices across 20 top Pfam classes. 

Legend for Figure 5)b) : Surveying the replacement cases further, this figure describes the cases where 
3io-helices are being replaced by a-helices, /3-sheets and loops, across 20 top Pfam classes. 
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